Introduction to Pandas
Pandas is a powerful, open-source data analysis and manipulation library for Python. It provides data structures and functions needed to work with structured data seamlessly. Built on top of NumPy, pandas is the cornerstone of data analysis in Python.
Data Structures
Series and DataFrame for efficient data handling
Data Manipulation
Powerful tools for reshaping and pivoting data
I/O Support
Read/write data from CSV, Excel, SQL, JSON, and more
Data Cleaning
Handle missing data and duplicates efficiently
Performance
Fast operations on large datasets
Analysis Tools
Statistical functions and aggregations
Installation & Setup
Install Pandas
# Install via pip
pip install pandas
# Install with NumPy and other dependencies
pip install pandas numpy matplotlib
# Install with conda
conda install pandas
Import Pandas
import pandas as pd
import numpy as np
# Check version
print(pd.__version__)
pd - this is the standard alias used throughout the data science community.
Pandas Series
A Series is a one-dimensional labeled array capable of holding any data type. It's like a column in a spreadsheet or a single column of a DataFrame.
Creating Series
# From a list
s = pd.Series([1, 3, 5, 7, 9])
print(s)
# With custom index
s = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
# From a dictionary
data = {'a': 10, 'b': 20, 'c': 30}
s = pd.Series(data)
# From NumPy array
s = pd.Series(np.random.randn(5))
Series Operations
s = pd.Series([1, 2, 3, 4, 5])
# Accessing elements
print(s[0]) # First element
print(s[1:4]) # Slicing
# Arithmetic operations
print(s + 10) # Add 10 to all elements
print(s * 2) # Multiply all by 2
# Statistical operations
print(s.mean()) # Mean
print(s.sum()) # Sum
print(s.max()) # Maximum
print(s.std()) # Standard deviation
Pandas DataFrames
A DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. Think of it as a spreadsheet or SQL table.
Creating DataFrames
# From a dictionary
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['New York', 'Paris', 'London']
}
df = pd.DataFrame(data)
# From a list of dictionaries
data = [
{'name': 'Alice', 'age': 25},
{'name': 'Bob', 'age': 30}
]
df = pd.DataFrame(data)
# From NumPy array
df = pd.DataFrame(
np.random.randn(4, 3),
columns=['A', 'B', 'C']
)
DataFrame Attributes
# View first/last rows
df.head() # First 5 rows
df.tail(3) # Last 3 rows
# Basic information
df.shape # (rows, columns)
df.columns # Column names
df.index # Row indices
df.dtypes # Data types of columns
# Summary statistics
df.info() # Detailed info
df.describe() # Statistical summary
Reading and Writing Data
Reading Data
# Read CSV file
df = pd.read_csv('data.csv')
# Read with specific options
df = pd.read_csv('data.csv',
sep=',',
header=0,
index_col=0,
parse_dates=['date_column'])
# Read Excel file
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Read JSON
df = pd.read_json('data.json')
# Read from SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table_name', conn)
# Read HTML tables
dfs = pd.read_html('https://example.com/table.html')
# Read clipboard
df = pd.read_clipboard()
Writing Data
# Write to CSV
df.to_csv('output.csv', index=False)
# Write to Excel
df.to_excel('output.xlsx', sheet_name='Sheet1', index=False)
# Write to JSON
df.to_json('output.json', orient='records')
# Write to SQL
df.to_sql('table_name', conn, if_exists='replace')
# Write to HTML
df.to_html('output.html')
Data Selection & Indexing
Column Selection
# Select single column
df['column_name']
# Select multiple columns
df[['col1', 'col2']]
# Using dot notation
df.column_name
Row Selection
# Select by label (loc)
df.loc[0] # Single row
df.loc[0:3] # Multiple rows (inclusive)
df.loc[0, 'name'] # Specific cell
# Select by position (iloc)
df.iloc[0] # First row
df.iloc[0:3] # First 3 rows
df.iloc[0, 1] # Row 0, Column 1
# Boolean indexing
df[df['age'] > 25]
df[(df['age'] > 25) & (df['city'] == 'Paris')]
Conditional Selection
# Query method
df.query('age > 25 and city == "Paris"')
# isin method
df[df['city'].isin(['Paris', 'London'])]
# String contains
df[df['name'].str.contains('Alice')]
# Between values
df[df['age'].between(25, 35)]
Data Cleaning
Handling Missing Data
# Check for missing values
df.isnull() # Returns boolean DataFrame
df.isnull().sum() # Count missing per column
df.notnull() # Opposite of isnull
# Drop missing values
df.dropna()